Explorations of the Orcas distribution
Temporal Patterns
The dataset spans 2017-01-05 to 2024-10-06, totaling 2,832 days (≈ 7.75 years).
Figure 2. Monthly time-series of orca encounters (x-axis: YYYY-MM; y-axis: encounters per month), showing month-to-month variation in observations.
Figure 3. Boxplots of monthly encounter counts across years (x-axis: month abbreviation; y-axis: encounters). Each box summarizes the distribution of counts for that month over all years in the dataset.
Result. Both figures indicate a pronounced seasonal peak in September. Across years, the median number of encounters in September is ≈18, whereas the median for other months is generally <10, indicating substantially higher activity in September.
Figure 4. Line chart of the share of annual encounters occurring in September, by year (x-axis: year; y-axis: September share of the yearly total).
Interpretation. The September share varies across 2017–2024 but never falls below ~15%, which is well above September’s calendar share of the year (~8.3%). Taken together with Figures 1–2, this indicates a clear seasonal pattern with a consistent peak in September.
Figure 5. Histogram of encounter durations (x-axis: duration in 1,000-second bins; y-axis: number of encounters).
Interpretation. The distribution is right-skewed with a long tail of very long encounters. Most encounters fall between 1,000 and 8,000 seconds, with the modal bin at 3,000–4,000 seconds (86 encounters).
Spatial Patterns
Figure 6. Map of orca encounter locations.
Interpretation. Encounters are not uniformly distributed; instead, they show clear spatial clustering with noticeable gaps between clusters. The map also highlights an outlying point north of 50.0° N, which may reflect either a genuine distant sighting or a data-entry/positioning error. Overall, Figure 5 provides a concise overview of the spatial pattern of encounters.
Figure 7. Hexbin heat map of encounter hotspots (hex size ≈ 5 km). Darker hexagons indicate higher encounter counts.
Interpretation. Consistent with Figure 5, encounters are concentrated within 48.0°–49.0° N and 123.5°–123.0° W, indicating a clear spatial core rather than a uniform distribution.
Interpretation. The reproducible map clarifies the coastline geometry behind Figure 6: the arc-shaped pattern of encounters follows the bay shoreline, with the majority of sightings clustered within the bay rather than offshore. This indicates that the observed spatial pattern is largely coastline-constrained.
Summary of Encounters
Figure 8. Top 20 observers by number of encounters. Dave Ellifrit records the most encounters. Mark Malleson ranks second with ~300 encounters—roughly twice the third-ranked observer. Together, Dave Ellifrit and Mark Malleson stand out as the most active observers in the dataset.
Figure 9. Top 10 vessels by number of encounters. Orcinus records the most encounters overall, while Mike1 ranks second with ~300 encounters. Both Orcinus and Mike1 clearly stand out as the most active vessels in the dataset.
Text Exploration of Encounters
From Figure 10, what we can get is that in the encounter summaries, the most frequent word is south, the second frequent word is island. “south” appears 1645 times and “island” appears 1633 times.
# A tibble: 20 × 2
bigram n
<chr> <int>
1 san juan 389
2 snug harbor 339
3 haro strait 275
4 race rocks 240
5 juan island 205
6 hundred yards 195
7 heading north 179
8 de fuca 168
9 juan de 168
10 victoria harbour 167
11 half mile 160
12 heading south 148
13 morning star 139
14 island shoreline 136
15 false bay 135
16 kellett bluff 127
17 constance bank 117
18 quarter mile 112
19 boundary pass 109
20 lime kiln 106
From the table, the most frequent bigram in the encounter summaries is “San Juan” (389 occurrences), followed by “Snug Harbor” and “Haro Strait.” These terms refer to place names in the study area: “San Juan” denotes the San Juan Islands (an archipelago), Snug Harbor is a locality on San Juan Island, and Haro Strait is the strait adjacent to San Juan Island.
Resources
The materials used for this report are:
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686.
Wickham H (2023). conflicted: An Alternative Conflict Resolution Strategy. R package version 1.2.0, https://CRAN.R-project.org/package=conflicted.
Ryan J (2025). orcas: Scrape and Visualize Orca Sighting Data. R package version 0.0.0.9000, commit 08b3808ee4f5c9f1a25cbedea9e9d8316322ed1c, https://github.com/jadeynryan/orcas.
H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
Wickham H, François R, Henry L, Müller K, Vaughan D (2023). dplyr: A Grammar of Data Manipulation. R package version 1.1.4, https://CRAN.R-project.org/package=dplyr.
Garrett Grolemund, Hadley Wickham (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1-25. URL https://www.jstatsoft.org/v40/i03/.
Wickham H, Pedersen T, Seidel D (2023). scales: Scale Functions for Visualization. R package version 1.3.0, https://CRAN.R-project.org/package=scales.
Wickham H (2023). stringr: Simple, Consistent Wrappers for Common String Operations. R package version 1.5.1, https://CRAN.R-project.org/package=stringr.
Pebesma, E., & Bivand, R. (2023). Spatial Data Science: With Applications in R. Chapman and Hall/CRC. https://doi.org/10.1201/9780429459016
Pebesma, E., 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10 (1), 439-446, https://doi.org/10.32614/RJ-2018-009
Hahsler M, Piekenbrock M (2025). dbscan: Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Related Algorithms. R package version 1.2.2, https://CRAN.R-project.org/package=dbscan.
Dunnington D (2023). ggspatial: Spatial Data Framework for ggplot2. R package version 1.1.9, https://CRAN.R-project.org/package=ggspatial.
Simon Garnier, Noam Ross, Robert Rudis, Antônio P. Camargo, Marco Sciaini, and Cédric Scherer (2024). viridis(Lite) - Colorblind-Friendly Color Maps for R. viridis package version 0.6.5.
Pedersen T (2025). ggforce: Accelerating ‘ggplot2’. R package version 0.5.0, https://CRAN.R-project.org/package=ggforce.
Cheng J, Schloerke B, Karambelkar B, Xie Y (2024). leaflet: Create Interactive Web Maps with the JavaScript ‘Leaflet’ Library. R package version 2.2.2, https://CRAN.R-project.org/package=leaflet.
Wickham H, Vaughan D, Girlich M (2024). tidyr: Tidy Messy Data. R package version 1.3.1, https://CRAN.R-project.org/package=tidyr.
Silge J, Robinson D (2016). “tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” JOSS, 1(3). doi:10.21105/joss.00037 https://doi.org/10.21105/joss.00037, http://dx.doi.org/10.21105/joss.00037.
The links to my use of ChatGPT for help on this assignment are:
- https://chatgpt.com/share/68a2fc3f-95a4-8000-bc1f-45d65b2839e4 helped to construct the regex to process the duration variable. Besides, it helps me to make a reproducible map given the latitude and longitude.